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h^ 1 Introduction 

-vj Wiki is the simplest online database that could possibly work [41]. It usually 

takes a form of a website or a webpage where the presentation is predefined to 

I" "^ some extent, but the content can be edited by a subset of users. The editing 

2 ideally does not require any additional software nor extra knowledge, takes place 

^H in a browser and utilises a simple notation for markup. Currently there are more 

^^ than a hundred of such notations, varying slightly in concrete syntax but mostly 

^ providing the same set of features for emphasizing fragments of text, making 

1 I tables, inserting images, etc [10]. The most popular notation of all is the one 

of MediaWiki engine, it is used on Wikipedia, Wikia and numerous Wikimedia 
'T^ Foundation projects. 

In order to facilitate development of new wikiware and to simplify main- 
vQ tenance of existing wikiware, one can rely on methods and tools from soft- 

\^ ware language engineering. It is a field that emerged in recent years, gen- 

^^ eralising theoretical and practical aspects of programming languages, markup 

f — languages, modelling languages, data definition languages, transformation lan- 

^^ guages, query languages, application programming interfaces, software libraries, 

^~~* etc [15, 23, 25, 70] and believed to be the successor for the object-oriented 

\ I paradigm [14]. The main instrument of software language engineering is on dis- 

^ ciplined creation of new domain specific languages with emphasis on extensive 

k^ automation. Practice shows that automated software maintenance, analysis, 

V^ migration and renovation deliver considerable benefits in terms of costs and 

C^ human effort compared to alternatives (manual changes, legacy rebuild, etc), 

especially on large scale [11, 61, 65]. However, automated methods do require 
special foundation for their successful usage. 

Wikiware (wiki engines, parsers, bots, etc) is a specific case of grammar- 
ware (parsers, compilers, browsers, pretty-printers, analysis and manipulation 
tools, etc) [25, 75]. The most straightforward definition of grammarware can 
be of software which input and/or output must belong to a certain language 
(i.e., can be specified implicitly or explicitly by a formal grammar). An op- 
erational grammar is needed to parse the code, to get it from a textual form 
that the programmers created into a specialised generational and transforma- 
tional infrastructure that usually utilises a tree-like internal format. In spite 



of the fact that the formal grammar theory is quite an estabhshed area since 
1956 [9], the grammars of mainstream programming languages are rarely freely 
obtainable, they are complex artefacts that are seen as valuable IT assets, re- 
quire considerable effort and expertise to compose and therefore are not always 
readily disclosed to public by those who develop, maintain and reverse engineer 
them. A syntactic grammar is basically a mere formal description of what can 
and what cannot be considered valid in a language. The most obvious sources 
for this kind of information are: language documentation, grammarware source 
code, international standards, protocol definitions, etc. 

However, documentation and specifications are neither ever complete nor 
error-free [79] . To obtain correct grammars and ensure their quality level, special 
techniques are needed: grammar adaptation [32] , grammar recovery [36] , gram- 
mar engineering [25], grammar derivation [27], grammar reverse engineering, 
grammar re-engineering, grammar archaeology [34], grammar extraction [75, 
§5.4], grammar convergence [37], grammar relationship recovery [39], gram- 
mar testing [33], grammar inference [64], grammar correction [75, §5.7], pro- 
grammable grammar transformation [74], and so on. The current document 
is mainly a demonstration of application of such techniques to the MediaWiki 
BNF grammar that was pubhshed as [47, 46, 51, 52, 49, 50, 48, 44]. 

1.1 Objectives 

The project reported in this document aims at extraction and initial recovery 
of the MediaWiki grammar. However, the extracted grammar is not the final 
goal, but rather a stepping stone to enable the following activities: 

Parse wiki pages. The current state of Wikipedia is based on a PHP rewriting 
system that transforms wiki layout directly into HTML [53]. However, 
it can not always be utilised in other external wikiware: for example, 
future plans of Wikimedia Foundation include having an in-browser editor 
with a WYSIWYG front-end in JavaScript [69]. Having an operational 
grammar means anyone can parse wiki pages more freely with their own 
technology of choice, either directly or by deriving tolerant grammars from 
the baseline grammar [27] . 

Aid wiki migration. The ability to easily parse and transform wiki pages 
can deliver considerable benefits when migrating wiki content from one 
platform to another [78]. 

Validate existing wiki pages. The current state of MediaWiki parser [53] 
allows users to submit wiki pages that are essentially incorrect: they may 
combine wiki notation with bare HTML, contain unbalanced markup, refer 
to nonexistent templates. This positively affects the user-friendliness of 
the wiki, but makes some wiki pages possibly problematic. Such pages 
can be identified and repaired with static code analysis techniques [7]. 

Test existing wiki parsers. There is considerable prior research in the field 
of grammar-based testing, both stochastic [43, 60] and combinatorial [19, 



33, 42, 56, 72], with important recent advances in formulating coverage 
criteria and achieving automation [16, 35]. These results can be easily 
reproduced to provide an extensive test data suite containing different 
wiki text fragments to explore every detail specified by the grammar in a 
fully automated fashion. Such test data suites can be used to determine 
existing parsers' conformance, can help in developing new parsers, find 
problematic combinations that are treated differently by different parsers, 
etc. 

Improve grammar readability. It is known that the grammar is meant to 
both define the language for the computer to parse, and describe it for the 
language engineers to understand. However, these two goals are usually 
conflicting, and more often than not, one opts for an executable grammar 
that is harder to read, than for a perfectly readable one that cannot be 
used in constructing grammarware. Unfortunately, the effort and expertise 
needed to fully achieve either of them, and most language documents 
contain non-operational grammars [28, 72, 79]. The practice of using 
two grammars: the "more readable" one and the "more implementable" , 
adopted in the Java specification [18], has also proven to be very ineffective 
and error- prone [38, 39]. 

Perform automated adaptation. Grammars commonly need to be adapted 
in order to be useful and efficient in wide range of circumstances [32]. 
Grammar transformation frameworks such as GDK [30], GRK [34] or 
XBGF [74] can be used to apply adapting transformations in a safe disci- 
plined way with validation of applicability preconditions and full control 
over the language delta. In fact, some of the transformations can even be 
generated automatically and applied afterwards. 

Establish inter-grammar relationships. As of today, several MediaWiki 
notation grammars exist and are available in one form or another: in 
EBNF [76], in ANTLR [5], etc (none of them are fully operational). 
Furthermore, there exist various other wiki notations: Creole [22], Wiki- 
dot [68], etc. Relationships among all these notations are unknown: they 
are implicit even when formal grammars actually exist, and are totally 
obscured when the notation is only documented in a manual. A spe- 
cial technique called language convergence can help to reengineer such 
relationships in order to make stronger claims about compatibility and 
expressivity [37, 75, 77]. 

1.2 Related work: grammar recovery initiatives 

Most of operational grammars for mainstream software languages are hand- 
crafted, many are not publicly disclosed, few are documented. The first case 
reported in detail in 1998 was PLEX (Programming Language for Exchanges), 
a proprietary DSL for real time embedded software systems by Ericsson [59], a 



successful application of the same technology to COBOL followed [62]. Gram- 
mar recovery technique is not only needed for legacy languages, examples of 
more modern and presumably more accurately engineered grammars being non- 
trivially extracted include C# in [73] and [75, §3] and Java in [38] and [39]. The 
whole process of MediaWiki grammar extraction is documented by this report, 
all corrections and refactorings are available online, as is the end result (under 
CC-BY-SA hcense). 

1.3 Related work: Wiki Creole 

Wiki Creole 1.0^ is an attempt for engineering an ideal wiki syntax and a formal 
grammar for it. While the goal of specifying the wiki syntax with a grammar is 
not foreign to us, but the benefits listed in [22, p. 3] are highly questionable: 

1. Trivial parser construction. In the paper cited above it is claimed that 

applying a parser generator is trivial. However, the main prerequisite for 
it is successful grammar adaptation for the particular parsing technol- 
ogy [32] . A Wiki Creole grammar was specifically geared toward ANTLR, 
and it is a highly sophisticated task to migrate it anywhere if at some 
point ANTLR use is deemed to be undesirable. Hence, the result is not 
reproducible without considerable effort and expertise. 

2. Foundation for subsequent semantics specification. The grammar 

can certainly serve as a basis for specifying semantics. However, the 
choice of a suitable calculus for such semantics specification is of even 
more importance. Furthermore, syntax definition docs not guarantee the 
absence of ambiguities in semantics, or even changes of semantics as a 
part of language evolution (cf., evolutionary changes of HTML elements). 

3. Improved communication between wikiware developers. The paper 

claimed that if wiki syntax is specified with a grammar, there can be no 
different interpretations of it. However, it is quite common to have differ- 
ent interpretations (dialects) of even mainstream programming languages, 
plus wiki technology in its current state heavily relies on fault tolerance 
(somewhat less so in the future when no bare text editing should be taking 
place) . 

4. Same rendering behaviour that users rely on. Depending on the 

browser or the particular gadget that the end user deploys to access the 
wiki, rendering behaviour can be vastly different, and this has nothing to 
do with the syntax specification. 

5. Simplified syntax extension. It is a very known fact in formal grammar 

theory [1] that grammar classes are not compositional: that is, the result 
of combining two LL(*) grammars (which ANTLR uses) does not nec- 
essarily belong to the LL(*) class; we can only prove that it will still be 
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context-free [9]. In other words, it is indeed easy to specify a syntax exten- 
sion, but such the extended grammar sometimes will not be operational. 
Modular grammars can be deployed in frameworks which use different 
parsing technologies, such as in Meta-Environment [24] or in Rascal [26] 
or in MPS [66], but not in ANTLR. 

6. Performance predictions. The paper claims that it is easier to predict 

performance of a parser made with "well-understood language theory" 
than with a parser based on regular expressions. However, there are im- 
plementation algorithms of regular expressions that demonstrate quadratic 
behaviour [12], and ANTLR uses the same technology for matching looka- 
head anyway, which immediately means that their performance is the 
same. 

7. Discovering ambiguities. It is true that ambiguity analysis is easier on a 

formal grammar than on the prose, but it is not achieved by "more rigorous 
specification mechanism" and even the most advanced techniques of today 
do not always succeed [4]. 

8. Well-defined interchange format. A well designed interchange format 

between different types of wikiware is a separate effort that should be 
based on appropriate generalisations of many previously existing wiki no- 
tations, not on one artificially created one, even if that one is better de- 
signed. 

In general, Wiki Creole initiative is relevant for us because it can serve as a 
common grammar denominator later to converge several wiki grammars [37, 77] , 
but is neither contributing nor conflicting directly with our grammar recovery 
project. 

2 Grammar notation 

One of the first steps in grammar extraction is understanding the grammar 
definition formalism (i.e., the notation) used in the original artefact to describe 
the language. In the case of MediaWiki, Backus-Naur form is claimed to be 
used [45]. Manual cursory examination of the grammar text [47, 46, 51, 52, 49, 
50, 48, 44] allows us to identify the following metasymbols in the spirit of [20] 
and [75]: 



Name 


Value 


Start graiTLmar symbol 


<source lang=bnf> 


End grammar symbol 


</source> 


Start comment symbol 


/* 


End comment symbol 


*/ 


Defining symbol 


: : = 


Definition separator symbol 


1 


Start nonterminal symbol 


< 


End nonterminal symbol 


> 


Start terminal symbol 


II 


End terminal symbol 


II 


Start option symbol 


[ 


End option symbol 


] 


Start group symbol 


( 


End group symbol 


) 


Start repetition star symbol 


{ 


End repetition star symbol 


} 


Start repetition plus symbol 


{ 


End repetition plus symbol 


}+ 



As we know from [3] and its research in [75, §6.3], BNF was originally defined 
as follows: 



Name 


Value 


Defining symbol 


; = 


Definition separator symbol 


or 


Terminator symbol 


<-^ 


Start nonterminal symbol 


< 


End nonterminal symbol 


> 



While the difference in the appearances of defining symbols is minor and 
is commonly overlooked, there are several properties of the notation used for 
MediaWiki grammar definition that place it well outside BNF, namely: 

• Using delimiters to explicitly denote terminal symbols (instead of using 
underlined decoration for keywords and relying on implicit assumptions 
for non-alphanumeric characters). 

• Presence of comments in the grammar (not in the text around it). 

• Allowing inconsistent terminator symbol (i.e., a newline or a double new- 
line, sometimes a semicolon). 

• Having metalanguage symbols for marking optional parts of productions. 

• Having metalanguage symbols for marking repeated parts of productions. 

• Having metalanguage symbols for grouping parts of productions. 



Hence, it is not BNF. For the sake of completeness, let us compare it to 
the classic EBNF, originally proposed in [71] (sometimes that dialect is referred 
to as Wirth Syntax Notation) and standardised much later by ISO as [20]: 



Name 


Value in WSN 


Value in ISO EBNF 


Concatenate symbol 




» 


Start comment symbol 




(* 


End comment symbol 




*) 


Defining symbol 


= 


= 


Definition separator symbol 


1 


1 


Terminator symbol 




i 


Start terminal symbol 


ri 


II 


End terminal symbol 


n 


II 


Start option symbol 


[ 




End option symbol 


] 




Start group symbol 


( 




End group symbol 


) 




Start repetition star symbol 


{ 




End repetition star symbol 


} 




Exception symbol 




- 


Postfix repetition symbol 




* 



We notice again a list of differences of MediaWiki grammar notation versus 
WSN and ISO EBNF: 

• Allowing inconsistent terminator symbol (i.e., a newline or a double new- 
line). 

• Presence of comments (consistent only with ISO EBNF). 

• Lack of concatenate metasymbol (consistent only with WSN) . 

• Having metalanguage symbol for exceptions (consistent only with WSN). 

• Not having a specially designated postfix symbol for denoting repetition 
(consistent only with WSN). 

Hence, the notation adopted by MediaWiki grammar, is neither BNF nor 
EBNF, but an extension of a subset of EBNF. Since we cannot reuse any previ- 
ously existing automated grammar extractor, we define this particular notation 
with FDD (EBNF Dialect Definition), a part of SLPS (Software Language Pro- 
cessing Suite) [80] — and use Grammar Hunter, a universal configurable 
grammar extraction tool, for extracting the first version. The definition itself is 
a straightforward XML-ification of the first table of this section, so we leave it 
out of this document. The only addition is switching on the options of disregard- 
ing extra spaces and extra newlines that are left after tokenising the grammar. 
The FDD is freely available for re- use in the subversion repository of SLPS^. 



Available as config.edd. 



3 Guided grammar extraction 

Since the grammar extraction process is performed for this particular notation 
for the first time, we use guided extraction, when the resuhs of the extraction 
are visually compared to the original text by an expert in grammar engineering. 
This document is a detailed explanation of observations collected in that process 
and actions undertaken to resolve the spotted issues. 

Given previous experience, it is safe to assume that once the grammar is 
extracted, we would like to change some parts of it (for grammar adaptation [32], 
deyaccification [58] and other activities common for grammar recovery [34]). In 
order for those changes to stay fully traceable and transparent, we will take 
the approach of programmable grammar transformation. In this methodology, 
we take a baseline grammar and an operator suite and by choosing the right 
operators and parametrising them, we program the desired changes in the same 
way mainstream programmers use programming languages to create software. 
These transformation scripts are executable with the grammar transformation 
engine: any meta-programming facility would suffice, for this particular work 
we use XBGF [74] which was shown in [39] to be the best and the most versatile 
grammar transformation infrastructure at this moment. The tools of SLPS 
that surround XBGF also allow for easy publishing by providing immediate 
possibilities to transform XBGF scripts to I^T^X or XHTML. 

3.1 Source for extraction 

The grammar of Media Wiki is available on subpages of [45] . Striving for more 
automation, we can use the "raw" action to download the content from the same 
makefile that performs the extraction'^. For example, the wiki source of Article 
Title [47] is http://www.mediawiki.org/w/index.php?title=Markup_spec/ 
BNF/Article_title&action=raw. In order to make our setup stable for the 
future when the contents of the wiki page may change (in fact, changing them 
is one of the main objectives of this work), we can add the revision number to 
that command, making it http://www.mediawiki .org/w/index.php?title= 
Markup_spec/BNF/Article_title&action=raw&oldid=295042. 

3.2 Article title 

Parsing Article Title [47] with Grammar Hunter is not hard and does not report 
many problems. One particular peculiarity that we notice when comparing the 
resulting grammar with the original, is the "... ?" symbol: 



<canoiiical-page-f irst-char> : := <ucas6-letter> I <digit> I <underscore> I 
<canoiiical-page-char> : := <letter> I <digit> I <underscore> I ...? 



The "... ?" symbol is not explained anywhere, but the intuitive meaning is 
that it is a metasymbol for a possible future extension point. For example, if in 
the future one decides to allow a hash symbol (#) in an article title (currently 
not allowed for technical reasons), it will be added as an alternative to the 



^Available as Makefile. 



production defining canonical-page-char. The very notion of such extension 
points contradicts the contemporary view on language evolution. It is commonly 
assumed that a grammar engineer cannot predict in advance all the places in the 
grammar that will need change in the future: hence, it is better to not mark any 
of such places explicitly and assume that any place can be extended, replaced, 
adapted, transformed, etc. Modern grammar transformation engines such as 
XBGF [74], Rascal [26] or TXL [13] all have means of extending a grammar in 
almost any desired place. Since it seems reasonable to remove these extension 
points at all, we can do it with XBGF after the extraction^: 

verticaK in canonical -page-first-char ); 
removeV ( 
canonical -page-first-char: 

II II II II II II II 7 It 

); 

horizontaU in canonical-page-first-char ); 

verticaK in canonical -page-char ); 

re move V( 

canonical -page-char : 

II II II II II II II 7 II 

); 

horizontaK in canonical-page-char ); 

verticaK in page-first-char ); 

re move V( 

page-first-char: 

It II II II II II 11711 

); 

horizontaK in page-first-char ); 

verticaK in page-char ); 

re move V( 

page-char: 

It II II II II II 11711 

); 

horizontaK in page-char ); 

By looking at the grammar where this transformation chain does not apply, 
one can notice productions in this style: 



<canonical-article-title> 

<canonical-sub-pages> 

<canonical-sub-page> 



= <canonical-page> [<canonical-sub-pages>] 
<canonical-sub-page> [<canonical-sub-pages>] 
<sub-page-separator> <canonical-page-chars> 



In simple words, what we see here is an optional occurrence of a nonterminal 
called canonical-sub-pages, which is defined as a list of one or more nontermi- 
nals called caiionical-sub-page. So, in fact, that optional occurrence consists 
of zero or more canonical-sub-page nonterminals. However, these observa- 
tions are not immediate when looking at the definition, because the production 
is written with explicit right recursion. This style of writing productions be- 
long to very early versions of compiler compilers like YACC [21], which required 
manual optimisation of each grammar before parser generation was possible. 
It has been reported later on multiple occasions [25, 58, etc] that it is highly 
undesirable to perform premature optimisation of a general purpose grammar 
for a specific parsing technology that may or may not be used with it at some 



*Part of remove-extension-points. xbgf. 



point in the future. The classic construct of a hst of zero or more nonterminal 
occurrences is called a Kleene closure [1] or Kleene star (since it is commonly 
denoted as a postfix star) and is omnipresent in modern grammarware practice. 
Using the Kleene star makes the grammars much more concise and readable. 
Most parser generators that require right-recursive (or left-recursive) expansions 
of a Kleene star, can do them automatically on the fly. Another possible reason 
for not using a star repetition could have been to stay within limits of pure BNF, 
but since we have already noted earlier that this goal was not reached anyway, 
we see no reason to pretend to seek it. A well-known grammar beautification 
technique known as "deyaccification" [58] is performed by the following grammar 
refactoring chain^: 

massage ( 

canonical-sub-pages? , 

(canonical-sub-pages I e)); 
distributee in canonical-sub-pages ) ; 
verticaK in canonical-sub-pages ); 
deyaccify(canonlcal-sub-pages) ; 
inline(canonical-sub-pages) ; 
massage ( 

(canonical-sub-page+ I e) , 

canonical-sub-page*) ; 
massage ( 

canonical -page-chars? , 

(canonical-page-chars I e)); 
distributee in canonical-page-chars ) ; 
verticaK in canonical -page-chars ); 
deyaccify(canonical-page-chars) ; 
inline(canonical -page-chars) ; 
massage ( 

(canonical-page-char+ I e) , 

canonical -page-char*) ; 
massage ( 

sub-pages?, 

(sub-pages I e)); 
distribute ( in sub-pages ) ; 
verticaK in sub-pages ); 
deyaccify( sub-pages) ; 
inline (sub-pages) ; 
massage ( 

(sub-page+ I e) , 

sub-page*) ; 
massage ( 

page-chars?, 

(page-chars I e)); 
distribute ( in page-chars ); 
verticaK in page-chars ); 
deyaccify (page-chars) ; 
inline (page-chars) ; 
massage ( 

(page-char+ I e) , 

page-char*) ; 



^Part of deyaccify. xbgf. 
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Even the simplest metrics can show us that these refactorings have simph- 
fied the grammar, reducing it from 15 VAR and 25 PROD to 11 VAR and 17 
PROD [55], without any fallback in functionality. They have also removed tech- 
nological idiosyncrasies and improved properties that are somewhat harder to 
measure, like readability and understandability. 

3.3 Article 

Article [46] contains seven grammar fragments, out of which only the first three 
conform to the chosen grammar notation. The last four were copy-pasted from 
elsewhere and use a different EBNF dialect, which we luckily can also analyse 
and identify: 



Name 


Value 


Defining symbol 


= 


Definition separator symbol 


1 


Start special symbol 


? 


End special symbol 


7 


Start terminal symbol 


n 


End terminal symbol 


n 


Start option symbol 


[ 


End option symbol 


] 


Start group symbol 


( 


End group symbol 


) 


Start repetition star symbol 


{ 


End repetition star symbol 


} 


Exception symbol 


- 



We will not lay out its step by step comparison with the notation used in 
the rest of the MediaWiki grammar, but it suffices to say that the presence of 
the exception symbol in the metalanguage is enough to make some grammars 
inexpressible in a metalanguage without it. BGF does not have a metasymbol 
for exception, but we still could express the dialect in EDD^ and extract these 
parts of the grammar with it. Judging by the presence of the Kleene star in 
the metalanguage, the grammar engineers who developed those parts did not 
intend to stay within BNF limits. Thus, we can also advise to add the use of a 
plus repetition for denoting a sequence of one or more nonterminal occurrences, 
in order to improve readability of productions like these: 

Line = PlalnText { PlainText }{""{""} PlainText { PlainText } } ; 
Text = Line { Line } { NewLine { NewLine } Line { Line } } ; 



Or, in postfix-oriented BNF that we use within SLPS: 



Line: 






PlainText PlainText* (" " " "* PlainText PlainText*)* 


Text: 






Line Line* (NewLine NewLine* Line Line*)* 



^Available at metawiki.edd. 
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Compare with the version that wc claim to be more readable: 



Line: 






PlainText+ (" "+ PlainText+)* 


Text: 






Line+ (NewLine+ Liiie+)* 



In fact, many modern grammar definition formalisms have a metaconstruct 
called "separator list" , because Text above is nothing more than a (multiple) 
Newline-separated list of Lines. We do not enforce this kind of metaconstructs 
here, but we do emphasize the fact that the very understanding of Text being a 
separated list of Lines was not clear before our proposed refactoring. In the case 
if MediaWiki still wants the grammar representation to have only one type of 
repetition or even no repetition at all, such a view can be automatically derived 
from the baseline grammar preserved in a more expressive metalanguage. The 
refactorings that utilise the plus notation are rather straightforward^: 

massage ( 

PlainText PlainText* , 

PlainText + ) ; 
massage ( 

Line Line* , 

Line+); 
massage ( 

NewLine NewLine* , 

NewLine+) ; 

massage ( 

II II II II* 

II 11 + ) . 

Further investigation draws our attention to these productions: 



PageName = TitleCharacter ,{[""] TitleCharacter } ; 
PageNameLink = TitleCharacter , { [ " " I "_" ] TitleCharacter } 



The comma used in both productions is not a terminal symbol ",": in fact, 
it is a concatenate symbol from ISO EBNF [20]. Since ISO EBNF is not the 
notation used, the commas must have been left out unintentionally — this is 
what usually happens when grammars are transformed manually and not in a 
disciplined way. Grammar Hunter assumed that the quotes were forgotten in 
this place (since a comma is not a good name for a nonterminal), so we need 
to project it away (the corresponding operator is called abstractize because 
it shifts a grammar from concrete syntax to abstract syntax). These are the 
transformations that we write down®: 

abstractize ( 

PageName : 

TitleCharacter (",") (" "? TitleCharacter)* 
); 

abstractize ( 
PageNameLink: 

TitleCharacter (",") ((" " I "_")? TitleCharacter)* 



^Part of utilise-repetition.xbgf . 

^Complete listing of remove-concatenation. xbgf. 
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); 

The following fragment uses excessive bracketing: parenthesis are used to 
group symbols together, which is usually necessary for inner choices and similar 
cases when one needs to override natural priorities. However, in this case it is 
unnecessary: 



SectionTitle = ( SectionLinkCharacter - "=" ) 

{ [ " " ] ( SectionLinkCharacter - "=" ) } 
LinkTitle = { UnicodeCharacter {""}}( UnicodeCharacter - "] " ) ; 



Excessive bracketing is not a problem for SLPS toolset since all BGF gram- 
mars are normalised before serialisation, and it includes a step of refactoring 
trivial subsequences, but we still report it for the sake of reproducibility within 
a different environment. 

The following grammar production uses a strange-looking construction that 
is explained in the text to be the "non-greedy" variant of the optional newline: 



<special-block-and-more> ::= 

<special-block> ( EOF I [<newline>] <special-block-and-more> 

I (<newline> I "") <paragraph-and-more> ) 



The purpose of a syntax definition such as a BNF is to define syntax of a 
language. Thus, any references to the semantics of the parsing process should 
be avoided. The definition of "greediness" as ordered alternatives, given at the 
first page of [45] , contradicts the classic definition based on token consumption, 
and contradicts the basics of EBNF. Approaches alternative to context-free 
grammars such as PEG [17] should be considered if ordered alternatives are 
really required. For EBNF (or BGF), we refactor the singularity as follows^: 

massage ( 
(newline I e) , 
newline?) ; 

Since at this point the subgrammar of this part must be rather consistent, we 
can execute some simple grammar analyses to help assess the grammar quality. 
One of them is based on a well-known notion of bottom and top nontermi- 
nals [58, 59]: a top is one that is defined but never used; a bottom is one that 
is used but never defined. We were surprised to see WhiteSpaces in the list of 
top nonterminals, while Whitespaces was in the list of bottom nonterminals. 
Apparently, a renaming is needed^^: 

unite (WhiteSpaces, Whitespaces); 

The definition of nonterminal BlockHTML contains textual annotation claim- 
ing that it is not yet referred to. We decided to parse it anyway and validate 
that assertion afterwards. Indeed, it showed up as an unconnected grammar 
fragment, which we can then safely remove^ ^: 

eliminate (BlockHTML) ; 



^Part of utilise-question.xbgf . 
^''Part of unify-whitespace.xbgf . 
^^Part of connect-grammar .xbgf . 
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3.4 Noparse block 

Apart from quoting the language name in the source tags, which makes the start 
grammar symbol change from <source lang=bnf> to <source lang="bnf ">, 
the Noparse Block [51] uses the same EBNF dialect that we derived as the 
starting step of our extraction. However, there are two major exceptions: 

• Round brackets and square brackets have swapped their meaning. 

• A lookahead assertion metasymbol is used, borrowed from Perl Compatible 
Regular Expressions library. 

The first impression given by cursory examination of the extracted grammar 
is that it uses excessive bracketing (mentioned in the previous section): 



<pre-block> 


::= <pre-opening-tag> (<whitespace>) <pre-body> 




(<whitespace>) [<pre-closing-tag> I (?=EOF) ] 


<pre-opeiiing-tag> 


::= "<pre" (<whitespace> (<characters>)) ">" 


<pre-closing-tag> 


::= "felt;/pre" (<whitespace>) ">" 


<pre-bociy> 


: := <characters> 



However, if we assume this to be true, the meaning of the grammar 
will become inadequate: for example, it will have mandatory whitespace in 
many places. On the other hand, making the last part of the grammar 
(<nowiki-closing-tag> I (?=EOF)) optional is also inadequate, because op- 
tional assertion will never make sense. This particular lookahead assertion is 
displayed as (?=EDF) and means basically an e that must be followed by EOF 
(even that definition is not that apparent from the low-level description saying 
"It asserts that an EOF follows, but does not consume the EOF."). The pres- 
ence or absence of lookahead based facilities is heavily dependent on the parsing 
technology, and therefore should be avoided as much as possible, as noted by 
multiple sources [25, 36, 58]. More straightforward and high level assertions 
like "should be followed by" and "should not be followed by" are available in 
modern metaprogramming languages like Rascal [26] instead. 

Since the general problem of leaving opened tags at the end of the article 
text is much bigger than the tags described in this part of the grammar, we 
opt for removing these assertions altogether and solving the problem later with 
suitable technology. EBNF has never been intended for and has never been good 
at defining tolerant parsers [27]. Since we have to construct another EBNF di- 
alect in order to parse the Noparse Block fragment correctly anyway, we specify 
"(?=EOF)" as a notation for e (otherwise we would have to fix the problem later 
with a horizontal remove operator from XBGF). Those explicit empty sequence 
metasymbols need to be refactored into proper optional symbols^'^: 

massage ( 

(nowiki-closing-tag I e) , 
nowiki-closing-tag?) ; 
massage ( 
(pre-closing-tag I e) , 
pre-closing-tag?) ; 



Part of remove-lookahead.xbgf . 
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massage ( 

(html-closing-tag I e) , 
html-closing-tag?) ; 

In every notation that comprises similar looking symbols and metasymbols 
that can be encountered within the same context, there is need for escaping some 
special characters. In this part of the MediaWiki grammar escaping is done 
in HTML entities, which is not explainable with grammar-based arguments. 
However, we recall that our extraction source is a handcrafted grammar that 
was meant to reproduce the behaviour of the MediaWiki PHP parse — so, in a 
sense, it was (manually) extracted, and what we have just encountered is in 
fact a legacy artefact randomly inherited from its source. Such legacy should 
be removed by following transformation steps^'^: 

renameT("&lt ;nowiki" , "<nowiki") ; 
renameT("&lt ;/iiowiki" , "</nowiki") ; 
renameT("&lt ;pre" , "<pre"); 
renameT ("felt; /pre" , "</pre") ; 
renameTC "felt; html", "<htnil") ; 
renameTC "felt; /html" , "</html") ; 
renameT("< ! — ", "<! — "); 
replace ( " &gt ; " , " > " ) ; 

There are two more problems in the Noparse Block part that concern the 
nonterminal characters. First, it is undefined (bottom). As we will see in §3.8, 
there is a nonterminal called character — issues like these with "forgetting" to 
define some nonterminals with readable names are quite common in handcrafted 
grammars, as noted by [28] and other sources. A trivially guessed definition for 
characters is either "one-or-more" or "zero-or-more" repetition of character. 
Since characters is mostly used as an optional nonterminal, we assume that it 
is one or more^**: 

define ( 

characters : 

character+ 

); 

The second problem is its usage in html-comment (remember that round 
brackets mean optionality here): 



<html-commeiit> ::= "felt;! — " ({ characters }) " — >" 

Since we do not need to make a Kleene repetition optional, we can refactor 
it as follows^^: 

unfold (characters in html-comment); 
massage ( 

character^* , 

character*) ; 
massage ( 

character*?, 

character*) ; 



^^ Complete listing of dehtmlify.xbgf . 
^*Part of connect-grammar .xbgf . 
^^Part of ref actor-repetition. xbgf . 
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/* not properly fleshed out, haven not tried all the combinations */ 
<article-link> :;= [<interwiki-pref ix> ] ":"] [<namespace-pref ix 

I "/" <article-title> 

I { "../" > [<article-title>] 



<article-title> 



:= { [<title-legal-chars> | "%" ] } + 



/* Specified using regex format, obviously... */ 
<title-legal-chars> ::= " % ! \ " $6 ' ( ) * , W- . \\/0-9 = i=? W-Z\\\\" 



<interwiki-pref ix> 
<interwiki> 



::= <interwiki> ":" 
: : = STRING FROM DB 



<namespace-pref ix> ::= [ <namespace> ] ":" 

<namespace> ::= STRING_FROM_CONFIG 

/* is it? parser. php gives it as " [_0-9A-Za-z\x80-\xf f ] " */ 



<link-description> 
<extra-description> 



<internal-link-start> 
<internal-link-end> 



LEGAL_ARTICLE_ENTITY 

<letter> [<extra-description>] 



/* Almost anything seems to be allowed, but it won't necessarily be treati 
<section-id> ;:= { [<title-legal-chars> | "%" | "#" ] } + 



Figure 1: A syntax that even MediaWiki cannot colour-code properly [52]. 



massage ( 

character* , 
character^?) ; 
fold (characters in html-comment) ; 

More detailed information about leaving combinations of various kinds of 
repetition and optionality in the deployed grammar will be given in the next 
section. 



3.5 Links 

Links definitions [52] exhibit bits of yet another notation, namely the one where 
a set of possible values is given, assuming that only one should be picked. In 
the MediaWiki grammar it is erroneously called a "regex format" — regular ex- 
pressions do use this notation in some places, but not everywhere and it is not 
exclusive to them. This notation is very much akin to "one-of" metaconstructs 
also encountered in definitions of other software languages such as C# [75, 
§3.2.4]. In the MediaWiki grammar, it looks like this: 



/* Specified using regex format, obviously... */ 
<title-legal-chars> ::= " •/,!\"$&' ()*,\\-.\\/0-9: ;=? 



-Z\\\\-_ ' a-z"\\x80-\\xFF+" 



The unobviousness of the notation is perfectly simplified by the fact that 
even the MediaWiki engine itself fails to parse and colour-code it correctly, as 
seen on Figure 1. In fact, when we look at the expression more closely, we can 
notice that it is even incorrect in itself, since it uses double-escaping for most 
backslashes (ruining them) and does not escape the dot (which denotes any 
character when unescaped). Some other characters like * or + should arguably 
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also be escaped, but it is impossible to decide firmly on escaping rules when we 
have no engine to process this string. However, the correct expression should 
have looked similar to this: 



<title-legal-chars> 



•/. ! \"$& ' * , \-\ ■ \/0-9 : ; =?aA-Z\\-_ ' a-z-\x80-\xFF+" 



Which 


we rewrite 


as 


(some invisible characters 


are 


omitted for readability): 


<title-legal-chars> 




: := " 


' 1 "7." 1 1 1 


"$" 


1 "&" 1 1 "(" 1 ")" 


1 "*" 1 


1 




- 






, 1 „/ 








1 "0" 1 


'1 

1 . 




2 




3 

7 


1 1 r,4 

1 1 "a 


, 1 „5„ 1 „g„ 1 „7„ 1 


"8" 


1 "9" 


1 "A" 1 


'B 




C 




D 


' 1 "E 


, 1 „p„ 1 „2„ 1 njjn 1 


"I" 


1 "J" 1 "K" 1 "L" 1 "M" 


1 "N" 1 


'0 




P 




Q 


' 1 "R 


, 1 „g„ 1 „T„ 1 „u„ 1 


"V" 


1 "W" 1 "X" 1 "Y" 1 "Z" 


1 "\" 1 


1- 




_ 




~ 


1 








1 "a" 1 


'b 




c 




d 


' 1 "e 


, 1 „f„ 1 „g„ 1 „!,„ 1 


"i" 


1 "j" 1 "k" 1 "1" 1 "m" 


1 "n" 1 


'0 




P 




q 


, 1 „j. 


, 1 „g„ 1 „^„ 1 „^„ 1 


"v" 


1 "w" 1 "x" 1 "y" 1 "z" 


1 "-" 1 


1 . 

1 




<t 




£ 


' 1 "n 


1 1 rrvK 1 11 < 11 






1 "§" 1 


1 ■■ 




© 




s 


1 1 tt^ 


, 1 „^„ 1 " << 1 H®.. 1 


II -rr 


1 11 O II 1 11 ,11 1 II 2 11 1 II 3 11 


1 " ' " 1 






1 








, 1 „,„ 1 „o„ 1 „»„ 1 


"M" 


1 "i^" 1 "%" 1 "i" 1 "A" 


1 "A" 1 


'A 




A 




A 


' 1 "A 


, 1 „;j„ 1 „g„ 1 „g„ 1 


"E" 


1 "E" 1 "E" 1 "I" 1 "i" 


1 "I" 1 


'I 




D 




N 


' 1 "0 


, 1 „Q„ 1 „Q„ 1 „Q„ 1 


"0" 


1 "X" 1 "0" 1 "U" 1 "U" 


1 "U" 1 


'U 




Y 




P 


' 1 "13 


1 1 ir^n 1 11^11 1 ngii | 


"a" 


1 "a" 1 "a" 1 "s" 1 "5" 


1 "e" 1 


'e 




e 




e 


' 1 "i 


1 1 irjn 1 11 J 11 1 11 J 11 1 


"3" 


1 "n" 1 "6" 1 "6" 1 "0" 


1 "5" 1 "6 




•r 







' 1 "u 


1 1 11^11 1 11^11 1 ii^jii 1 


"y" 


1 "J>" 1 "y" 1 "+" 



This refactored version with all alternatives given explicitly was created au- 
tomatically by a trivial Python one-liner and can be parsed without any trouble 
by Grammar Hunter. We should also note that the name for this nonterminal 
is misleading, since it represents only one character. This is not a technical 
mistake, but we can improve learnability of the grammar by fixing it^*': 

renameN (title-legal-chars, title-legal-char) ; 

Grammar Hunter displays an error message but is capable of dealing with 
this fragment: 



<article-link> 



[<interwiki-pref ix> 



] [<namespace-pref ix] <article-title> 



The problem in this grammar production is in "[<n£miespace-pref ix]" 
(note the unbalanced angle brackets). The start nonterminal symbol here is 
followed by the name of the nonterminal and then by the end option symbol 
without the end nonterminal symbol. This kind of problems are rather com- 
mon in grammars that have been created manually and have never been tested 
in any environment that would make them executable or validate consistency 
otherwise. Grammar Hunter can resolve this problem by using the heuristic 
of next best guess, which is to assume that the nonterminal name ended at 
the first alphanumeric/non-alphanumeric border that happened after the un- 
balanced start nonterminal symbol. 

Next, consider the following two grammar productions that lead to several 
problems simultaneously: 



^Part of fix-names, xbgf. 
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<article-title> : := { [<title-legal-chars> I ""/." ] } + 
<section-id> ::= { [<title-legal-chars> I "7," I "#" ] } 



As we have explained above, the grammar notation used for the MediaWiki 
grammar was never defined exphcitly in any formal or informal way, so we had to 
infer it in §2. When inferring its semantics, we had two options: to treat the plus 
as a postfix metasymbol or to treat "{" and "}+" as bracket metasymbols. Both 
variants are possible and feasible, since Grammar Hunter is capable of dealing 
with ambiguous starting metasymbols (i.e., "{" as both a start repetition star 
symbol and a start repetition plus symbol). We obviously opt for the latter 
variant because from the formal language theory we all know that for any x it is 
always true that (a;*)"*" = x*, so a postfix plus operation on a star repetition is 
useless and we tend to assume good faith of grammar engineers who made use of 
it. But even if we assume it to be a transitive closure (a plus repetition), which is 
a common notation for a sequence of one or more occurrences of a subexpression, 
the productions become parseable, but they are bound to deliver problems with 
ambiguities [4] on later stages of grammar deployment, since in these particular 
grammar fragments optional symbols are iterated. 

To give a simple example, suppose we have a nonterminal x defined as a"*", 
and a itself is defined as "a"? (either "a" or e). Then the following are two 
distinct possibilities to parse ^^aal^ with such a grammar: 



/ 


\ 


a 


a 


t 


t 


"a" 


"a" 





a+ 




^ 


t 


\ 


a 


a 


a 


t 


t 


t 


"a" 


e 


"a 



The number of such ways to parse even the simplest of expressions is infinite, 
and special algorithms need to be utilised to detect such problems at the parser 
generator level. Thus, to prevent this trouble from happening, we massage the 
productions above to use a simple star repetition instead, which is an equivalent 
unambiguous construct ^^: 

massage ( 

(title-legal-chars I '"/.")?+, 

(title-legal-chars I '"/,")*); 
massage ( 

(title-legal-chars I '"/," I "#")?+, 

(title-legal-chars I "'/." I "#")*); 

Reading further, we notice one of the nonterminals being defined with ex- 
plicit right recursion: 

<extra-description> ::= <letter> [<extra-description>] 



The problem is known and has been discussed above, all we need here is 
proper deyaccification^**: 

massage ( 



^'^Part of utilise-repetition.xbgf . 
^^Part of deyaccify.xbgf . 
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extra-description?, 

(extra-description I e) 

in extra-description) ; 
distributee in extra-description ) ; 
verticaK in extra-description ); 
deyaccify (extra-description) ; 

The last problem with the Links part of the grammar is the use of natural 
language inside a BNF production: 



<protocol> : := ALLDWED_PROTOCDL_FROM_CONFIG (e.g. "http://", "mailto:") 



Examples are never a part of a syntax definition: the alternatives are either 
listed exhaustively (like we will do later when we make the grammar complete) 
or belong in the comments (like it was undoubtedly intended here). A projection 
is needed to remove them from the raw extracted grammar^^: 

project ( 
protocol: 

ALLOWED_PROTOCOL_FROM_CONFIG ((e "." g "." "http://" "," "mailto:")) 

); 



3.6 Magic links 

Just like Noparse Block discussed above in §3.4, Magic Links [49] also uses 
<source lang="bnf "> as the start grammar symbol, but this is the least prob- 
lem encountered in this fragment. Consider the following productions: 

<isbn> ::= "ISBN" (" " + ) <isbn-niimber> ? (non-word-character /\b/) 

<isbn-number> ::= ("97" ("8" I "9") (" " I "-")?) (DIGIT (" " I "-")?) 

{9} (DIGIT I "X" I "x") 



We see a notation where: 

• A postfix plus repetition metasymbol is used, which is not encountered 
anywhere else in the MediaWiki. 

• The character used as the postfix repetition metasymbol clashes with end 
repetition plus metasymbol from Inline Text [48] and Links [52] ^'^. 



• 



A postfix optionality metasymbol is used, which is not encountered any- 
where else in the MediaWiki. 



• The character used as the postfix optionality metasymbol clashes with 
start special metasymbol and end special metasymbol from Article [46], 
Inhne Text [48] and Special Block [50]. 

• The same character used as the postfix optionality metasymbol is used as 
in a prefix notation that relies on lookahead. 

• A regular expression is used inside the lookahead assertion. 

^^Complctc listing of remove-conmients.xbgf . 

■^"indirect clash of "}+" being an end repetition plus symbol as well as a sequence of an end 
repetition star symbol and a postfix repetition metasymbol. 
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• A terminal symbol ("9") is not explicitly marked as such. 

• A nonterminal symbol ("DIGIT") is not explicitly marked as such. 

Along with the discussion from §3.4, we first remove the lookahead assertions. 
They (arguably) do not belong in EBNF at all, and definitely do not belong in 
such a form^^: 

project ( 

isbn: 

"ISBN " + " isbn-number (("?" non-word-character "/" "\" b "/")> 

); 

We do not even try to add the postfix plus repetition metasymbol to the 
notation definition, since it is used only once, since it clashes with something 
else, and since there is a special nonterminal spaces that should be used instead 
anyway : 

replace ( 



II II II -|- II 



spaces) ; 

Then we adjust the grammar for the untreated postfix question metasym- 



bop3. 

abstractizeC 

isbn-number : 

"97" ("8" I "9") (" " I "-") ("?") DIGIT (" " I "-") ("?") "9 
(DIGIT I "X" I "x") 
); 

w^idenC 
(II II I ii_ii)_ 

(II II I ii_ii)? 

in isbn-number) ; 



3.7 Special block 

Just as in [47] , the Special Block uses a special metasymbol for omitted grammar 
fragments [50]. This case is subtly different from the one discussed in §3.2 in a 
sense that it explicitly says in the accompanying text that "The dots need to be 
filled in" . This information is undoubtedly useful, but considering the fact that 
its very presence renders the grammar non-executable, we decide to remove it 
from the grammar and let the documentation tell the story about how much of 
the intended language does the grammar cover^^: 

vertical ( in special-block ); 

re move V( 

special-block: 

It II II II II II 

); 

horizontal ( in special -block ); 

^^Part of remove-lookaliead.xbgf . 
■^^Part of unif y-whitespace.xbgf . 
■^■^Part of utilise-question.xbgf . 
^*Part of remove-extension-points.xbgf . 
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In the same first production there is an alternative that reads 
<nowiki><table></nowiki>, which seems like either a leftover after manually 
cleaning up the markup, or a legacy escaping trick. Either way, nowiki wrap- 
ping is not necessary for displaying this fragment and is generally misleading: 
the chevrons around "table" mean to denote it explicitly as a nonterminal, not 
as an HTML tag. We project away the unnecessary parts^^: 

verticaK in special-block ); 
project ( 

special-block: 

(nowiki) table (/ nowiki) 
); 
horizontaK in special-block ); 

There are also more cases of excessive bracketing which are fixed automati- 
cally by Grammar Hunter: 



<def ined-term> ::= ";" <text> [ (<def inition>)] 



A nonterminal symbol called dashes is arguably superfluous and can be 
replaced by a Kleene star of a dash terminal: 



<horizontal-rule> ::= " " [<dashes>] [<inline-text>] <newline> 

<dashes> : := "-" [<dashes>] 



Still, we can keep it in the grammar for the sake of possible future BNF- 
ification, but refactor the idiosyncrasy (the right recursion) ^^: 

massage ( 

dashes?, 

(dashes I e) 

in dashes) ; 
distributee in dashes ) ; 
verticaK in dashes ); 
deyaccify(dashes) ; 

The worst part of the Special Block part is the section titled "Tables": 
it contains eight productions in a different notation, with a comment "From 
meta... minor reformatting". This reformatting has obviously been performed 
manually, since it does not utilise the standard notation of the rest of the gram- 
mar, nor is it compatible with the MetaWiki notation that we have encountered 
in §3.3: the defining symbol is from the MediaWiki notation, the terminator 
symbol is from the MetaWiki notation, etc: 



^^Part of fix-markup, xbgf. 
^^Part of deyaccify.xbgf . 
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Name 


Value 


Defining symbol 


: : = 


Terminator symbol 


; 


Definition separator symbol 


1 


Start special symbol 


7 


End special symbol 


7 


Start terminal symbol 


II 


End terminal symbol 


" 


Start nonterminal symbol 


< 


End nonterminal symbol 


> 


Start option symbol 


[ 


End option symbol 


] 



To save the trouble of post-extraction fixing, we used this configuration as 
a yet another EDD file to extract this grammar fragment and merge it with 
the rest of the grammar. The naming convention of the fragment is still not 
synchronised with the rest (i.e., camel case vs. dash-separated lowercase), but 
we will deal with it later in §5. 

We also see a problem similar to the one discussed above in §3.4, namely an 
optional zero-or-more repetition: 



<space-block> 



<iiillne-text> <newline> [ {<space-block-2} ] 



The solution is also already known to us^^: 

massage ( 

space-block-2*? , 
space-block-2*) ; 

When comparing the list of top nonterminals with the list of bottom ones, 
we notice TableCellParEimeters being used while TableCellParameter being 
defined. Judging by its clone named TableParameters, the intention was to 
name it plural, so we perform unification'^'^: 

unite (TableCellParameter, TableCellParameters) ; 



3.8 Inline text 

Suddenly, [48] uses buUeted-list notation for listing alternatives in a grammar: 



<text-with-f orniatting> 



<formatting> 

<iiiline-htinl> 

<noparseblock> 

<behaviour-switch> 

<open-guilleinet> I <close-guillemet> 

<htnil-entity> 

<html-unsaf e-syinbol> 

<text> 

<r andom- char act er> 

(more missing?) . . . 



^^Part of refactor-repetition.xbgf . 
^*Part of fix-names, xbgf. 
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This is almost never encountered in grammar engineering, but not completely 
unknown to computer science — for example, TLA"*" uses this notation [40]. In 
our case it is confusing for Grammar Hunter since newlines are also used in 
the notation to separate production rules, and since it only happens in two 
productions, we decide to manually remove the first bar there. The last line 
of the sample above also shows an extension point discussed earlier in §3.2 and 
§3.7, which we remove^^: 

verticaK in text-with-f ormatting ); 
re move V( 
text-with-f ormatting: 

(more missing "?") "." "." "." 
); 
horizontaU in text-with-f ormatting ); 

Nonterminal noparseblock is referenced in the same grammar fragment, 
but never encountered elsewhere in the grammar, later we will unite it with 
noparse-block when specifically considering enforcing consistent naming con- 
vention in §5.5. 

The next problematic fragment is the following: 



<html-entity-name> ::= Sanitizer : :$wgHtmlEntities (case sensitive) 
(* "Aacute" I "aacute" I ... *) 



It has three problems: 

• Referencing PHP variables from the grammar is unheard of. 

• Static semantics within postfix parenthesis in plain English is not helpful. 



• 



A comment that uses "(*" and "*)" as delimiters instead of "/*" and 
"*/" used in the rest of the grammar. 



These identified problems can be solved with projecting excessive symbols, 
leaving only one nonterminal reference, which will remain undefined for now'^": 

project ( 

html-entity-najne : 

((Sanitizer ":" ":" "$")> wgHtmlEntities ((case sensitive (("*" "Aacute") 
I "aacute" I ("." "." "." "*")))) 

); 

Later in §5.6 we will reuse the source code of Sanitizer class to formally 
complete the grammar by defining wgHtmlEntities nonterminal. 

The following fragment combines two double problems that have already 
been encountered before. The first problem is akin to the one we have noticed 
in §3.5, namely having a nonterminal with "-characters" in its name, which is 
supposed to denote only one character taken from a character class; the second 
part of that problem is the usage of the regular expression notation. The second 
problem is an omission/extension point (cf. §3.2 and §3.7), which is expressed 
in Latin: 



■^^Part of remove-extension-points. xbgf. 
•"'Complete listing of remove-php-legacy.xbgf . 
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<harmless-characters> : := / [A-Za-zO-9] etc 

Wc rewrite it as follows: 



<harmless 


-characters> 




: = 


















"A" 1 


"B" 


"C" 1 


'D" 


"E" 1 


„P„ 


"G" 


"H" 


"I" 


"J" 1 


"K" 


"L" 1 


„„„ 


1 "N" 1 


"Q" 


"P" 1 


'Q" 


"R" 1 


"S" 


"T" 


"U" 


"V" 


"W" 1 


"X" 


"Y" 1 


"Z" 


1 "a" 1 


"b" 


"c" 1 


■d" 


"e" 1 


"f" 


"g" 


"h" 


"i" 


"j" 1 


"k" 


"1" 1 


"m" 


1 "n" 1 


"o" 


"p" 1 


■q" 


"r" 1 


"s" 


"t" 


"u" 


"v" 


"w" 1 


"x" 


"y" 1 


"z" 


1 "0" 1 


"1" 


"2" 1 


■3" 


"4" 1 


"5" 


"6" 


"7" 


"8" 


"9" 









The name of the nonterminal symbol harmless-characters is misleading, 
since it represents only one character. In fact, simple investigation into top 
and bottom nonterminals [36] shows that it is not referenced anywhere in the 
grammar, but a nonterminal harmless-character is used in the definition of 
text. Hence, we want to unite those two nonterminals^^: 

unite (harmless-characters, harmless-character) ; 

The immediately following production contains a special symbol written in 
the style of ISO EBNF and MetaWiki: 



<random- char act er> 



any character 



Instead of adjusting the assumed notation definition, we choose to let Gram- 
mar Hunter parse it as it is, and to subsequently transform the result to a special 
BGF metasymbol with the same semantics (i.e., "any character" )'^^: 

redefine ( 

random-character : 

ANY 
); 

The next problematic fragment once again contains omission/extension 
points: 



<ucase-letter> : 


: = 


"A" 


-B" 1 . 


. . 1 "Y" 


"Z" 


<lcase-letter> : 


: = 


"a" 


"b" 1 . 


■• 1 "y" 


"z" 


<decimal-digit> : 


: = 


"0" 


"1" 1 . 


. . 1 "8" 


"9" 



Since in fact they represent all possible alternatives from the given range, 
we rewrite them as follows: 



<ucase-letter> 




: = 


















"A" 1 "B" 1 "C" 1 


"D" 1 


"E" 1 


"F" 1 


"G" 1 


"H" 1 


"I" 1 


"J" 1 


"K" 1 


"L" 1 


"M" 


1 "N" 1 "0" 1 "P" 1 


"Q" 1 


"R" 1 


"S" 1 


"T" 1 


"U" 1 


"V" 1 


"W" 1 


"X" 1 


"Y" 1 


"Z" 


<lcase-letter> 




: = 


















"a" 1 "b" 1 "c" 


"d" 


1 "e" 


1 "f" 


1 "g" 


1 "h" 


1 "i" 


1 ".i" 


"k" 


1 "1" 


1 "m" 


1 "n" 1 "0" 1 "p" 


"q" 


1 "r" 


1 "s" 


1 "t" 


1 "u" 


1 "v" 


1 "w" 


"x" 


1 "y" 


1 "z" 


<decimal-digit> 




: = 


















"0" 1 "1" 1 "2" 


„3„ 


1 "4" 


1 "5" 


1 "6" 


1 "7" 


1 "8" 


1 "9" 









The same looking metasymbol is used later as a pure extension point: 



<symbol> ::= <html-unsaf e-symbol> I <underscore> I "." I "," I ... 

The crucial difference in the semantics of these two metasymbols both de 
noted as ". . ." lies in the fact that in the former one (i.e., "A" I . . . I "Z") 
it is basically a macro definition that can be expanded by any human reader, 
but in the latter one (i.e., " . " I . . . ) the only thing the reader learns from 



^^Part of fix-names, xbgf. 
^^Part of def ine-lexicals.xbgf . 
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looking at it is that something can or should be added. Hence, following the 
conclusions we drew above, we expand the former omission metasymbol right 
in the grammar source, but we remove the latter omission metasymbol with 
grammar transformation'^'^ : 

vertical ( in symbol ); 
re move V( 

symbol : 

); 

horizontaK in symbol ); 

Finally, we notice some of the productions using explicit right recursion: 



<newlines> : 


:= <iiewline> [<iiewlines>] 


<space-tabs> : 


:= <space-tab> [<space-tabs>] 


<spaces> : 


:= <space> [<spaces>] 


<decimal-n"uniber> : 


:= <decimal-digit> [<decimal -number >] 


<hex-iiuinber> : 


:= <hex-digit> [<hex-n"iiraber>] 



The deyaccifying transformation steps are straightforward 

massage ( 

newlines?, 

(newlines I e) 

in newlines) ; 
distributee in newlines ) ; 
verticaK in newlines ); 
deyaccify(newlines) ; 
massage ( 

space-tabs?, 

(space-tabs I e) 

in space-tabs) ; 
distributee in space-tabs ); 
verticaK in space-tabs ); 
deyaccify(space-tabs) ; 
massage ( 

spaces?, 

(spaces I e) 

in spaces) ; 
distribute( in spaces ) ; 
verticaK in spaces ); 
deyaccify( spaces) ; 
massage ( 

decimal-number? , 

(decimal-number I e) 

in decimal-number) ; 
distribute ( in decimal -number ) ; 
verticaK in decimal-number ); 
deyaccify (decimal -number) ; 
massage ( 

hex-number? , 

(hex-number I e) 

in hex-number) ; 
distribute( in hex-number ); 
verticaK in hex-number ); 

■^•^Part of remove-extension-points. xbgf. 
■^*Part of deyaccify. xbgf . 
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34. 



deyaccify (hex-number) ; 

We should specially note here that the form used to define spaces prevented 
us earlier in §3.6 from using less invasive grammar transformation operators. 
What we ideally want is a transformation that is as semantics preserving as 
possible"^^: 

fold (space) ; 
fold (spaces) ; 

It is intentional that these two steps affect the whole grammar. We will 
return to this issue later in §5.2. 

There are two views given on formatting: an optimistic one and a realistic 
one. Since the grammar needs to define the allowed syntax in a structured way, 
we scrap the the latter"^^: 

re move V( 

formatting: 

apostrophe- jungle 
); 
eliminate (apostrophe- jungle) ; 

The whole section describing Inline HTML was removed from [48] prior to 
extraction because it combines two aspects that are not intended to be defined 
with (E)BNF: it defines a different language embedded inside the current one 
(this can be done in a clean way by using modules in advanced practical frame- 
works like Rascal [26]) and it tries to define rules for automated error fixing (cf. 
fault-tolerant parsing, tolerant parsing, etc). It suffices to note here that the 
metalanguage used in the parts of that section that were formulated not in plain 
English, is fascinatingly different from the parts of the MediaWiki grammar that 
we have already processed: it uses attributed (parametrised) nonterminals and 
postfix modifiers for case (in)sensitivity. The same metasyntax is used in the 
next section about images, so we do need to find a way to process chunks like 
this: 



^^Part of unify-whitespace.xbgf . 
^^Complete listing of remove-duplicates. xbgf. 
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ImageModeManualThumb : 
ImageMode Aut oThumb : 
ImageModeFrame : 
ImageModeFrameless : 

/* Default settings: */ 
mw("img_manualthumb") : 
mw("img_thumbnail") : 
mw("img_f rame") : 
mw("img_f rameless") : 

ImageOtherParameter : 
ImageParamPage : 
ImageParamUpgright : 
ImageParamBorder : 

/* Default settings: */ 
mw("img_page") : 
mw("img_upright") : 
mw("img_border") : 


:= mw("img_manualthumb") ; 
: = mw ( " img_thumbnai 1 " ) ; 
:= mw("img_f rame") ; 
:= mw("img_f rameless") ; 

:= "thumbnail=" , ImageName I "thumb=" , ImageName 

:= "thumbnail" I "thumb"; 

:= "framed" I "enframed" I "frame"; 

:= "f rameless"; 

:= ImageParamPage I ImageParamUpright I ImageParamBorder 
: = mw ( " img_page " ) 
:= mw("img_upright") 
:= mw("img_border") 

:= "page=$l" I "page $1" ??? (where is this used?) 
:= "upright" [, ["=",] Positivelnteger] 
:= "border" 



We try to list the problems within that grammar fragment: 

• Parametrised nonterminals are used in a style of function calls. This 
is not completely uncommon to grammarware since the invention of van 
Wijngaarden grammars [63] and attribute grammars [29], but unnecessary 
here. 

• Some productions end with a terminator symbol ";", others don't. 

• Concatenate metasymbol " , " is used rather inconsistently (occurs between 
some metasymbols, doesn't occur between some nonterminal symbols). 

• Inline comments are given in English without consistent explicit separation 
from the BNF formulae. 



The shortest way to overcome these difficulties is to reformat them lexically, 
unchaining parametrised nonterminals and appending terminator symbols to 
productions that did not have them. The result looks like this: 



ImageModeManualThumb 
ImageMode Aut oThumb 
ImageModeFrame 
ImageModeFrameless 

ImageOtherParameter 
ImageParamPage 
ImageParamUpgright 
ImageParamBorder 



= "thumbnail=" , ImageName I "thumb=", ImageName ; 

= "thumbnail" I "thumb"; 

= "framed" I "enframed" I "frame"; 

= "frameless"; 

= ImageParamPage I ImageParamUpright I ImageParamBorder 
= "page=$l" I "page $1"; /* ??? (where is this used?) */ 
= "upright" [, ["=",] Positivelnteger] 
= "border" 



One of the fragments fixed in this way contains postfix metasymbols for case 
insensitivity: 



<behaviour switch-too 

<behaviourswitch-forcetoc> 

<behaviourswitch-notoc> 

<behaviourswitch-noeditsection> 

<behaviourswitch-nogallery> 



_TOC__"i 

_FORCETOC__"i 

_NOTOC__"i 

_NOEDITSECTIDN_ 

_NOGALLERY__"i 
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These untypical metasymbols are parsed by Grammar Hunter as separate 
nonterminals, which we remove by projection'^'': 

project ( 

behaviourswitch-toc : 

"__TQC__" (i) 
); 

project ( 
behaviourswitch-f orcetoc : 

"__FDRCETOC__" (i) 
); 

project ( 
behaviourswitch-notoc : 

"__NDTOC__" (i) 
); 

project ( 
behavi our switch-noedit sect ion: 

"__NQEDITSECTION__" (l) 
); 

project ( 
behaviourswitch-nogallery : 

"__NOGALLERY__" (i) 
); 

There is also a mistake that is easily overlooked unless you analyse top and 
bottom nonterminals (look at the second option): 



ImageAlignParameter : := ImageAlignLef t I ImageAlign I Center I 

Image AlignRight I Image AlignNone 



This extra unnecessary bar is parsed as a regular choice separator, so we 
need to fix it this way^^^: 

replace ( 

(ImageAlign I Center) , 
(ImageAIignCenter) ) ; 

The same analysis shows us a fragment in the resulting grammar, which is 
unconnected because ImageOption does not list it with the others'^^: 

verticaK in image-option ); 
addV( 
image-option: 

image-other-parameter 
); 
horizontaU in image-option ); 

The first and the last productions of the Images subsection contain an ex- 
plicitly marked nonterminal symbol: 



Imagelnline ::= "[[" , "Image:" , PageName, ".", 

ImageExtension, ( { <Pipe>, ImageOption, } ) "]]" ; 
Caption ::= <inline-text> 

A production in the middle of the Images subsection and the first production 
of the Media subsection make inconsistent use of a concatenate symbol: 



^'^ Complete listing of remove-postfix-case. xbgf. 

^^Part of fix-names, xbgf . 
3911 



^Part of connect-grammar .xbgf . 
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ImageSizeParameter ::= PositiveNumber "px" ; 

Medialnline ::= "[[" , "Media:" , PageName "." MediaExtension "]]' 



And, finally, the last production of the Media subsection contains a wrong 
defining symbol: 



MediaExtension = "ogg" I "wav" 



These three problems were reported and overcome by Grammar Hunter but 
not solved automatically, because usually there is more that one way to resolve 
such issues, and a human intervention is needed to make a choice. After the 
unified notation is enforced everywhere, we can extract the grammar and con- 
tinue recovering it with grammar transformation steps. It should be noted that 
Grammar Hunter could not resolve the lack of concatenate symbols, since it 
starts assuming that the following symbol is a part of the current one (origi- 
nally the concatenate symbol was proposed in [20] in order to allow nonterminal 
names contain spaces), but it easily dealt with excessive concatenate symbols 
because they just virtually insert e here and there, which gets easily normalised. 

Back to the rest of the section, we have a fragment with essentially an 
extension point specified in plain English as the right hand side of a production: 



Gallerylmage : := (to be defined: essentially foo.jpg[| caption] ) 

We can easily decide to disregard this definition in favour of a really working 
one^": 

redefine ( 

Gallerylmage : 

ImageName ("I" Caption)? 
); 

After analysing top and bottom nonterminals, we easily spot 
unespaced-less-than being bottom and unescaped-less-thcoi being 
top — apparently, they were meant to be one, and the other one is a misspelled 
variation typically found in big handcrafted grammars. The same issue arises 
with some other nonterminals, apparently this grammar fragment was typed 
by someone rather careless at spelling'*^: 

unite(unespaced-less-than, unescaped-less-than) ; 
unite (ImageParamUpgright, ImageParamUpright) ; 
unite (ImageValignParameter, ImageVAlignParameter) ; 



3.9 Fundamental elements 

Surprisingly for those who did not look at the text of the Inline Text part, 
the Fundamental Elements [44] does not contain any new grammar productions 
for us, because all of them were encountered within the Inline Text, slightly 
reordered. 

'*''Part of remove-extension-points. xbgf. 
^^Part of fix-names, xbgf . 
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4 Conclusion 

This section contains the hst of imperfections found in the MediaWiki grammar 
definition. In the parenthesis we refer to the section in the text that unveils the 
problem or explains it. 

• Non-extended Backus-Naur form was claimed to be used (§2) 

• Three different metalanguages used for parts of the grammar (§3.3, §3.4, 
§3.7) 

• BuUeted-list notation for alternatives is used, both untraditional and in- 
consistent with other grammar fragments (§3.8) 

• Atypical metasymbols used: 

". . .?" (§3.2) — not defined, assumed to be an extension point 

"(?=EOF)" (§3.4) — defined in terms of lookahead symbols 

"(" and ")" (§3.4) — unexpectedly used to denote optionality 

" [" and "] " (§3.4) — unexpectedly used for grouping 

"+" (§3.6) — not defined, assumed to be a plus repetition 

"?" (§3.6) — not defined, assumed to be a postfix optionality 

"?()" (§3.6) — not defined, assumed to be a lookahead assertion 

". . ." (§3.7, §3.8) — omissions due to the lack of knowledge 

". . ." (§3.8) — omissions to denote values from the range of alter- 
natives 

"(*" and "*)" (§3.8) — start and end comment symbols 

• An undesirable omission/extension point metasymbol was used (§3.2, §3.7, 
§3.8) 

• An undesirable exception metasymbol was used (§3.3) 

• An attempt to use metasyntax to distinguish between two choice semantics 
(§3.3) 

• "Yaccified" productions with explicit right recursion (§3.2, §3.8) 

• Underused metalanguage functionality: obfuscated "plus" repetitions and 
separator lists (§3.3) 

• Misspelled nonternnnal names w.r.t. case: WhiteSpaces vs. Whitespaces 
(§3.3), InlineText vs. inline-text (§3.7, §3.8), etc 

• Mistyped nonterminal names: unespaced-less-thain 
vs. unescaped-less-than and ImageParamUpgright vs. 
ImageParcunUpright (§3.8) 



30 



• Varying grammar fragment delimiters (§3.4, §3.6) 

• Not marking terminals explicitly with the chosen notation (§3.6) 

• Not marking nonterminals explicitly with the chosen notation (§3.6) 

• Escaping special characters with HTML entities (§3.4) 

• Usage of "regexp format" to specify title legal characters (§3.5) 

• Insufhcient and excessive escaping within "regexp format" (§3.5) 

• Misleading nonterminal symbol name: plural name for a single character 
(§3.5, §3.8) 

• Improper omission of the end nonterminal metasymbol (§3.5) 

• Natural language (examples given in parenthesis) as a part of a BNF 
production (§3.5) 

• Inherently ambiguous constructs like a?+ and a*7 (§3.4, §3.5, §3.7) 

• Excessive bracketing (§3.3, §3.7) 

• Unintentionally undefined nonterminals (§3.4) 

• Referencing PHP variables like Sanitizer: :$wgHtmlEntities and con- 
figuration functions like mw("img_thumbiiail") (§3.8, §5.6) 

5 Finishing touches 

Table 1 shows the progress of several grammar metrics during recovery: TERM 
is the number of unique terminal symbols used in the grammar, VAR is the 
number of nonterminals defined or referenced there, PROD is the number of 
grammar production rules (counting each top alternative in them) [31]. We 
have already discussed bottom and top nonterminals from [36, 58, 59] earlier in 
§3.3. It is known and intuitively understood that high numbers of top and bot- 
tom nonterminals indicate unconnected grammar. In the ideal grammar, only 
few top nonterminals exist (preferably just one, which is the start symbol) and 
only few bottoms (only those that need to be defined elsewhere — lexically or in 
another language) [36]. Thus, our finishing touches mostly involved inspection 
of the tops and bottoms and their elimination. The very last step called "sub- 
grammar" in Table 1 extracted only the desired start symbol (wiki-page) and 
all nonterminals reachable from its definition. 

Using the terminology of [36] , in this section we move from a level 1 grammar 
(i.e., raw extracted one) to a level 2 grammar (i.e., maximally connected one). 
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TERM 


VAR 


PROD 


Bottom 


Top 


After extraction 


304 


188 


691 


78 


29 


After utilise-repetition.xbgf 


304 


188 


691 


78 


29 


After remove-concatenation. xbgf 


304 


188 


691 


78 


29 


After remove-extension-points .xbgf 


304 


188 


684 


73 


29 


After remove-php-legacy .xbgf 


302 


188 


684 


70 


29 


After deyaccify .xbgf 


302 


187 


680 


70 


29 


After remove-comments .xbgf 


300 


187 


680 


68 


29 


After remove-lookahead.xbgf 


300 


184 


680 


66 


29 


After remove-duplicates. xbgf 


300 


183 


678 


66 


29 


After dehtmlify.xbgf 


299 


183 


678 


66 


29 


After utllise-questlon.xbgf 


299 


183 


678 


66 


29 


After fix-markup. xbgf 


299 


183 


678 


64 


29 


After def ine-special-symbols.xbgf 


299 


183 


678 


62 


29 


After fake-exclusion. xbgf 


299 


183 


678 


58 


26 


After remove-postfix-case. xbgf 


299 


183 


678 


57 


26 


After fix-names. xbgf 


307 


182 


681 


37 


14 


After unify-whitespace.xbgf 


307 


181 


681 


31 


13 


After connect-grammar .xbgf 


307 


181 


671 


16 


7 


After ref actor-repetition. xbgf 


307 


181 


671 


16 


7 


After def ine-lexicals. xbgf 


310 


187 


671 


9 


7 


After subgrammar 


310 


177 


664 


8 


1 



Table 1: Simple metrics computed on grammars during transformation. 



5.1 Defining special nonterminals 

There is a range of nonterminals used in the MediaWiki grammar that have 
noticeably specific names (starting and ending with a question sign or being 
uppercased): they are not defined by the grammar, but usually the text around 
their definition is enough for a human reader to derive the intended semantics 
and then to specify lacking grammar productions. We also unify the naming 
convention while doing so (the final steps of that unification will be present 
in §5.5) and leave some nonterminals undefined (bottom) to serve connection 
points to other languages (more of that in §5.6)'*'^: 

vertical ( in TableCellParameter ); 
re move V( 

TableCellParameter : 

?HTML cell attributes ? 
); 
addV( 

TableCellParameter : 

html-cell-attributes 
); 

horizontaK in TableCellParameter ); 
verticaK in TableParameters ); 
re move V( 

TableParameters : 

?HTML table attributes ? 
); 
addV( 



Complete listing of def ine-special-symbols.xbgf . 
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TableParameters : 

html-t able-attributes 
); 

horizontal ( in TableParameters ); 
define ( 
FR0M_LANGUAGE_F1LE : 

"#redirect" 
); 

inline (FROM_LANGUAGE_FILE) ; 
define ( 
STRING_FROM_DB : 

"Wikipedia" 
); 

inline (STR1NG_FR0M_DB) ; 
define ( 
STRING_FR0M_C0NF1G : 

STR 
); 

inline (STR1NG_FR0M_C0NFIG) ; 
define ( 
NS_CATEGORY : 

"Category" 
); 

inline (MS.CATEGDRY) ; 
define ( 
ALL0WED_PR0TQC0L_FR0M_C0NF1G : 

"http://" 

"https://" 

"ftp://" 

"ftps://" 

"mailto: " 
); 

inline (ALLOWED_PROTOCOL_FRDM_CONFIG) ; 
unite (LEGAL_ART1CLE_ENT1TY, article-title) ; 



5.2 Unification of whitespace and lexicals 



Another big metacategory of nonterminal symbols represent the lexical part, 
which is not always properly specified by a syntactic grammar. In the Medi- 
aWiki grammar case, there were several attempts to cover all lexical peculiarities 
including problems arising from using Unicode (i.e., different types of spaces and 
newlines), so the least we can do is to unify those attempts. Future work on 
deriving a level 3 grammar from the result of this project, will use test-driven 
correction to complete the lexical part correctly [36] . Our current goal is to pro- 
vide a high quality level 2 grammar without destroying too much information 
that can be reused later '*'^: 

unite (?_variaiits_of_spaces_?, space) ; 

unite (?_carriage_return_and_line_feed_?, newline) ; 

unite (?_carriage_return_?, CR) ; 

unite (?_line_feed_?, LF) ; 

inline (NewLine) ; 



^Part of unify-whitespace.xbgf . 
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unfold (newline in Whitespaces) ; 
foldCnewline in Whitespaces); 
unite (?_tab_?, TAB); 

Another specificity is only referenced but not defined directly by the gram- 
mar. According to the text of Inline Text section [48] , this is a patch for dealing 
with French punctuation. It is highly debatable whether such specificity should 
be found in the baseline grammar, but since it is not defined properly anyway, 
we decide to root it out^"': 

verticaK in text-with-f ormatting ); 
re move V( 

text-with-f ormatting : 
open-guillemet 
); 
remove V ( 

text-with-f ormatting: 
close-guillemet 
); 
horizontaK in text-with-f ormatting ); 

Some bottom lexical nonterminals are trivially defined in BGF^^: 

define ( 

TAB: 

"\t" 
); 

define ( 
OR: 

"\r" 
); 

define ( 
LF: 

"\n" 
); 

define ( 
any-text : 

Unicode-character* 
); 

define ( 
sort-key: 

any-text 
); 

define ( 
any-supported-unic ode- character: 
ANY 
); 



5.3 Connecting the grammar 

The Magic Links part (see 3.6) apparently referenced some nonterminals that 
were never used. We can easily pinpoint them with a simple grammar analysis 

^*Part of unify-whitespace.xbgf . 
^^Part of def ine-lexicals.xbgf . 
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showing bottom nonterminals, and after that program the appropriate transfor- 



defineC 

digits: 

digit+ 
); 

uniteCdigit, decimal-digit); 
uniteCDIGIT, decimal-digit); 

Undefined nonterminals Positivelnteger and PositiveNumber both can 
be merged with this new nonterminal^^: 

unite(PositiveInteger , digits); 
uniteCPositiveNumber , digits); 

Nonterminal newlines defined at [48] and [44], is also never used and can 
be eliminated''*: 

eliminate (newlines) ; 

Last connecting steps are easy since there are not that many top and bottom 
nonterminals left, and a simple human inspection can show that some of them 
are actually misspelled pairs like this one^^: 

unite (ImageModeThumb, image-mode-auto-thumb) ; 
unite (category, category-link); 

In Links section [52] there is a discussion on whether there should be a 
syntactic category for all links (i.e., internal and external). The discussion 
seems to be unfinished, with the nonterminal link specified, but unused (i.e., 
top). Since the definition is already available, we decided to use it by folding 
wherever possible^*^: 

fold (link); 



5.4 Mark exclusion 

BGF does not have a metaconstruct for exclusion ("a should be parseable as b 
but not as c" , mostly specified as "<a> : : = <b> - <c>" within the MediaWiki 
grammar), but we still want to preserve the information for further refactor- 
ing. One of the ways to do so is to used a marking construct usually found in 
parameters to transformation operators such as project or addH"''^: 

replace ( 

?_all_supported_Unicode_characters_?_-_Whitespaces, 
((any-supported-unicode-character Whitespaces))) ; 
replace ( 



*^Part of connect-grammar .xbgf . 
^''^Part of connect-grammar .xbgf . 
^®Part of connect-grammar .xbgf . 
^^Part of connect-grammar .xbgf . 
^''Part of connect-grammar .xbgf . 
^^Part of fake-exclusion. xbgf . 
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UnicodeCharacter_-_WikiMarkupCharacters, 

((UnlcodeCharacter WikiMarkupCharacters))) ; 
replace ( 

SectionLinkCharacter_- "=", 

((SectionLlnkCharacter "="))); 
replace ( 

UnicodeCharacter_- "] " , 

((UnlcodeCharacter "]"))); 
replace ( 

UnicodeCharacter_-_BadTitleCharacters, 

((UnlcodeCharacter BadTltleCharacters))) ; 
replace ( 

UnlcodeCharacter_-_BadSectlonLlnkCharacters, 

((UnlcodeCharacter BadSectlonLlnkCharacters))) ; 



5.5 Naming convention 



There are three basic problems with the naming convention if we look at the 
whole extracted grammar, namely: 

Unintelligible nonterminal names. When looking at a particular grammar 
production rule situated close to a piece of text explaining all kinds 
of details that did not fit in the BNF, it is easy to overlook non- 
informative names. In the case of MediaWiki, in the final grammar 
we have bottom nonterminals with the names like FROM_LANGUAGE_FILE, 
STRING_FRDM_CDNFIG, STRING_FRQM_DB. Such names do not belong in the 
grammar, because they obfuscate it, and the main reason for having a 
grammar printed out in an EBNF-like form in the first place is to make it 
readable for a human. 

Letters capitalisation. Nonterminal names can be always written in lower- 
case, or in uppercase, or in any mixture of them. The choice of parsing 
technology can influence that choice: for instance. Rascal [26] can only 
process capitalised nonterminal names and ANTLR [54] treats uppercase 
nonterminals and non-uppercase ones differently. These implicit seman- 
tic details need to be acknowledged and accounted for, in a consistent 
manner, which was not the case in the MediaWiki grammar. 

Word separation. Most of the nonterminals have names that consist of sev- 
eral natural words (e.g., "wiki" and "page"). There are several ways to 
separate them: by straightforward concatenating ( "wikipage" ) , by camel- 
casing ( "WikiPage" or "wikiPage" ) , by hyphenating ( "wiki-page" ) , by al- 
lowing spaces in nonterminal names ("wiki page"), etc. It does not matter 
too much which convention is used, as long as it is the same throughout 
the whole grammar. In the case of MediaWiki there is no consistency, 
which leads to not only decreased readability, but also to problems like 
noparse-block being defined in [51] and noparseblock being used in [48] 
(they were obviously meant to be one nonterminal). 
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The complete transformation script enforcing a consistent naming conven- 
tion and fixing related problems on the way, looks like this^^: 

unite (noparseblock, noparse-block) ; 

unite (GalleryBlock, gallery-block) ; 

unite (Imagelnline, image -inline) ; 

unite (Medialnline, media-inline) ; 

uniteCTable, table); 

unite (Text, text); 

unite (InlineText, inline-text); 

uniteCPipe, pipe); 

renameN ( Any Text , any-text) ; 

renameNCBadSectionLinkCharacters, bad-section-link-characters) ; 

renameN (BadTitleCharacters, bad-title-characters) ; 

renameN (Caption, caption); 

renameN (Gallery Image, gallery-image) ; 

renameNdmageAlignCenter, image-align-center) ; 

renameNdmageAlignLeft , image-align-left) ; 

renameNdmageAlignNone, image-align-none) ; 

renameNdmageAlignParameter, image-align-parameter) ; 

renameN (ImageAlignRight, image-align-right) ; 

renameN (ImageExtension, image-extension) ; 

renameN (ImageModeAutoThumb, image-mode-auto-thumb) ; 

renameN (ImageModeFrame, image-mode-frame) ; 

renameN (ImageModeFrameless, image-mode-f rameless) ; 

renameN (ImageModeManualThumb, image-mode-manual-thumb) ; 

renameN (ImageModeParameter, image-mode-parameter) ; 

renameN (ImageName, image-name); 

renameNdmageOption, image-option) ; 

renameN (ImageOtherParameter, image-other-parameter) ; 

renameN (ImageParamBorder, image-param-border) ; 

renameN (ImageParamPage, image-param-page) ; 

renameN (ImageParamUpright, image-param-upright) ; 

renameN (ImageSizeParameter, image-size-parameter) ; 

renameN (ImageValignBaseline, image-valign-baseline) ; 

renameN (ImageValignBottom, image-valign-bottom) ; 

renameN (ImageValignMiddle, image-valign-middle) ; 

renameN (ImageVAlignParameter, image-valign-parameter) ; 

renameN (ImageValignSub, image-valign-sub) ; 

renameN (ImageValignSuper, image-valign-super) ; 

renameN (ImageValignTextBottom, image-valign-text-bottom) ; 

renameN (ImageValignTextTop, image-valign-text-top) ; 

renameNdmageValignTop, image-valign-top) ; 

renameN(Line, line); 

renameN (LinkTitle, link-title); 

renameN (MediaExtension, media-extension) ; 

renameN (PageName, page-name); 

renameN (PageNameLink, page-name-link) ; 

renameN (PlainText, plain-text); 

renameN (SectionLink, section-link) ; 

renameN (SectionLinkCharacter, section-link-character) ; 

renameN (SectionTitle, section-title) ; 

renameN (TableCellParameters, table-cell-parameters) ; 

renameN(TableColumn, table-column) ; 

renameN (TableColumnLine, table-column-line) ; 

renameN (TableColumnMultiLine, table-column-multiline) ; 



52 



Part of fix-names, xbgf. 
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renameN(TableFirstRow, table-first-row) ; 
renameN(TableParameters, table-parameters) ; 
renameN(TableRow, table-row); 
renameN(TitleCharacter, title-character) ; 
renameNCUnicodeCharacter, unicode-cheiracter) ; 
renameNCUnicodeWiki, unicode-wiki) ; 
renameN(WikiMarkupCharacters, wiki-markup-characters) ; 

As one can see, we reinforce hyphenation in almost all places, except for 
nonterminals inherited from other languages (e.g., blockquote from HTML). 
The list of plain renamings was derived automatically by a Python one-liner that 
transformed CamelCase to dash-separated names. The XBGF engine always 
checks preconditions for renaming a nonterminal (i.e., the target name must be 
fresh) , so then it was trivial to turn the non- working renameN calls into unite 
calls. 

5.6 Embedded languages 

We may recall seeing wgHtmlEntities undefined nonterminal being referenced 
in §3.8. There are more like it — in fact, at the end of our recovery project there 
are 8 bottom nonterminals in the grammar: 



• 



• 



LEGAL_URL_ENTITY: designates a character that is allowed in a URL; de- 
fined by the corresponding RFC [6]. 

inline-html: was removed deliberately due to incompleteness and ques- 
tionable representation; defined partially by the accompanying English 
text, partially by the HTML standard [57]. 

• math-block: the syntax used by the math extension to MediaWiki [67]. 

• CSS: cascading style sheets used to specify layout of tables and table 
cells [8]. 

• html-table-attributes and html-cell-attributes: also layout of ta- 
bles and table cells, but in pure HTML. 



• 



WgHtmlEntities: one of the HTML entities ("quot", "dagger", "auml", 
etc). 



They are all essentially different languages that are reused here, but are not 
exactly a part of wiki syntax. Some wiki engines may allow for different subsets 
of HTML and CSS features to be used within their pages, but conceptually 
these limitations are import parameters, not complete definitions. For instance, 
we could derive a lacking grammar fragment for wgHtmlEntities by looking at 
the file mw_saiiitizer . inc from MediaWiki distribution'''^: 

^•^ Available as mediawiki.conf ig.wiki. 
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<wgHtmlEiitities> : := "Aacute" I "aacute" I "Acirc" I "acirc" I "acute" I "AElig" 
aelig" I "Agrave" I "agrave" I "alefsym" I "Alpha" I "alpha" I "amp" I "and" 
ang" I "Aring" I "aring" I "asymp" I "Atilde" I "atilde" I "Auml" I "auml" 



bdquo 
cedil 
crarr 
Delta 
ecirc 



I 



"ensp" I "Epsilon" 
I "Euml" I "euml" I 
"frac34" I "frasl" 
"hearts" I "hellip" 
"Igrave" I "igrave" 



Beta" I "beta" I "brvbar" I "bull" I "cap" I "Ccedil" I "ccedil" 
cent" I "Chi" I "chi" I "circ" I "clubs" I "cong" I "copy" 
cup" I "curren" I "dagger" I "Dagger" I "darr" I "dArr" I "deg" 
delta" I "diams" I "divide" I "Eacute" I "eacute" I "Ecirc" 
Egrave" I "egrave" I "empty" I "emsp" I 
epsilon" I "equiv" I "Eta" I "eta" I "ETH" I "eth" 
exist" I "fnof" I "forall" I "fracl2" I "fracl4" I 
Gamma" I "gamma" I "ge" I "gt" I "harr" I "hArr" I 
lacute" I "iacute" I "Icirc" I "icirc" I "iexcl" I 
image" I "infin" I "int" I "Iota" I "iota" I "iquest" I "isin" I "luml" 
iuml" I "Kappa" I "kappa" I "Lambda" I "lambda" I "lang" I "laquo" I "larr" 
lArr" I "Iceil" I "Idquo" I "le" I "Ifloor" I "lowast" I "loz" I "Irm" 
Isaquo" I "Isquo" I "It" I "macr" I "mdash" I "micro" I "middot" I "minus" 
Mu" I "mu" I "nabla" I "nbsp" I "ndash" I "ne" I "ni" I "not" I "notin" 
nsub" I "Ntilde" I "ntilde" I "Nu" I "nu" I "Oacute" I "oacute" I "Ocirc" 
ocirc" I "QElig" I "oelig" I "Ograve" I "ograve" I "oline" I "Omega" 
omega" I "Dmicron" I "omicron" I "oplus" I "or" I "ordf" I "ordm" I "Oslash" 
oslash" I "Otilde" I "otilde" I "otimes" I "Ouml" I "ouml" I "para" I "part" 
permil" I "perp" I "Phi" I "phi" I "Pi" I "pi" I "piv" I "plusmn" I "pound" 
prime" I "Prime" I "prod" I "prop" I "Psi" I "psi" I "quot" I "radio" 
rang" I "raquo" I "rarr" I "rArr" I "rceil" I "rdquo" I "real" I "reg" 
rfloor" I "Rho" I "rho" I "rim" I "rsaquo" I "rsquo" I "sbquo" I "Scaron" 
scaron" I "sdot" I "sect" I "shy" I "Sigma" I "sigma" I "sigmaf" I "sim" 
spades" I "sub" I "sube" I "sum" I "sup" I "supl" I "sup2" I "sup3" I "supe" 
szlig" I "Tau" I "tau" I "there4" I "Theta" I "theta" I "thetasym" I "thinsp" 
THORN" I "thorn" I "tilde" I "times" I "trade" I "Uacute" I "uacute" I "uarr" 
uArr" I "Ucirc" I "ucirc" I "Ugrave" I "ugrave" I "uml" I "upsih" I "Upsilon" 
upsilon" I "Uuml" I "uuml" I "weierp" I "Xi" I "xi" I "Yacute" I "yacute" 
yen" I "Yuml" I "yuml" 1 "Zeta" I "zeta" I "zwj" I "zwnj" 



These are 252 entities taken from the DTD of HTML 4.0 [57]. XHTML 
1.0 defines an additional entity called "apos" [2], which, technically speaking, 
can be handled by MediaWiki since in its current state it rewrites wikitext 
to XHTML 1.0 Transitional. Whether it is the grammar's role to report an 
error when it is used, remains an open question. Furthermore, suppose we are 
developing wikiware which is not a WYSIWYG editor, but a migration tool or 
an analysis tool: this would mean that the details about all particular entities 
are of little importance, and one could define an entity name to be just any 
alphanumeric word. Questions like these arise when languages are combined, 
and for this particular project we leave the bottom nonterminals that represent 
import points, undefined. 
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6 Results and future work 

This document has reported on a successful grammar recovery effort. The in- 
put for this project was a community-created MediaWiki grammar manuaUy 
extracted from the PHP tool that is used to transform wiki text to HTML. This 
grammar contained unconnected fragments in at least five different notations, 
bearing various kinds of errors from conceptual underuse of base notation to 
simple misspellings, rendering the grammar fairly useless. As an output we pro- 
vide a level 2 grammar, ready to be connected to adjacent modules (grammars 
of HTML, CSS, etc) and made into a higher level grammar (e.g., test it on a 
real wiki code) . Naturally, this effort is one step in a long way, and we take the 
rest of the report to sketch the next milestones and planned deliverables: 

Fix grammar fragments. The first thing we can do is regenerate the original 
grammar fragments in the same notation. One one hand, this would help 
to not alienate the grammar from its creators; on the other hand, the 
fragments will use a consistent notation throughout the grammar and be 
validated as not having any misspellings, metasymbol omissions, etc. 

Derive several versions. Just in case the same MediaWiki grammar is 
needed in several different notations (e.g., BNF and EBNF), we can de- 
rive them from the baseline grammar with either inferred or programmable 
grammar transformation. 

Propose a better notation. Whether or not the pure BNF grammar is de- 
livered to Wikimedia Foundation, it will be of limited use to most people. 
ANTLR notation that Wiki Creole used, is more useful, but even less 
easy to comprehend. Both more readable and more expressive variants of 
grammar definition formalisms exist and can be advised for use based on 
the required functionality. 

Find ambiguities and other problems. Various grammar analysis tech- 
niques referenced in the text above can be used to perform deeper analyses 
on the grammar in order to make it fully operational in Rascal, resolve ex- 
isting ambiguities, and perhaps even spot problems that are unavoidable 
with the current notation. 

Complete the lexical part. Some lexical definitions were already found in 
the source grammar, and were mostly preserved through the recovery 
process. A level 3 grammar can be derived from our current result by 
reinspecting these definitions together with textual annotations found on 
MediaWiki.org. 
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